In this project, a database containing the health status, as well as many other related factors, for all countries will be analyzed.

Firstly, an initial exploratory analysis of the data will be conducted to identify potential missing values and outliers, and corresponding decisions will be made to address them.

Secondly, a Principal Component Analysis (PCA) will be performed. This involves condensing the information provided by the original variables into a few linear combinations, aiming to achieve dimensionality reduction. These combinations seek maximum variance and are perpendicular to each other, capturing the directions in which observations vary the most and are uncorrelated with each other.

Thirdly, a Factor Analysis (FA) will be carried out, where latent variables (unobservable) with a high correlation to a group of observable variables and virtually no correlation with the rest will be identified. This results in further dimensionality reduction.

Finally, Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) will be conducted, verifying the necessary assumptions of normality. Discriminant Analysis is a method for classifying qualitative variables, allowing the classification of new observations based on their characteristics (explanatory or predictor variables) into different categories of the qualitative response variable.

1 Loading packages and the dataset

Loading/installation of R packages necessary for this practice.

#install.packages("readr")
#Package required to call the 'read.csv' function.
library(readr)

#install.packages("pastecs")
#Package required to call the 'stat.desc' function.
library(pastecs)

#install.packages("summarytools") 
#Package required to call the 'freq' function.
library(summarytools)

#install.packages("MASS")
#Package required to call the 'lda' function.
library(MASS)

#install.packages("psych")
#Package required to call the 'cortest.bartlett' function.
library(psych)

#install.packages("polycor")
#Package required to call the 'hetcor' function.
library(polycor)

#install.packages("ggcorrplot")
#Package required to call the 'ggcorrplot' function.
library(ggcorrplot)

#install.packages("corrplot")
#Package required to call the 'corrplot' function.
library(corrplot)

#install.packages("ggplot2")
#Package required to call the 'ggplot' function.
library(ggplot2)

#install.packages("stats")
#Package required to call the 'factanal' function.
library(stats)

#install.packages("ggpubr")
#Package required to call 'ggarrange' function.
library(ggpubr)

#install.packages("reshape2")
# Package required to call 'melt' function.
library(reshape2)

#install.packages("biotools")
#Package required to call 'boxM' function.
library(biotools)

#install.packages("energy")
#Package required to call 'mvnorm.etest' function.
library(energy)

#install.packages("MVN")
#Package required to call 'mvn' function.
library(MVN)

#install.packages("klaR")
#Package required to call 'partimat' function.
library(klaR)

1.1 Description of the dataset

The dataset is taken from the Global Health Observatory (GHO) data repository under World Health Organization (WHO) keeps track of the health status as well as many other related factors for all countries. The data is from year 2000-2015 for 193 countries. It contains the following variables:

  • Country: Country Name.
  • Year: Year.
  • Status: Developed or Developing.
  • Life Expectancy: Life expectancy in age.
  • Adult Mortality: Probability of dying between 15 and 60 years per 1000 population.
  • Infant Deaths: Number of infant deaths per 1000 population.
  • Alcohol: Recorded per capita consumption (in litres).
  • Percentage Expenditure: Expenditure on health as per GDP (%).
  • Hepatitis B: Immunization coverage among 1-year-olds (%).
  • Measles: Number of reported cases per 1000 population.
  • BMI: Average BMI of the entire population.
  • Under-Five Deaths: Number of under-five deaths per 1000 population.
  • Polio Immunization: Coverage among one-year-olds (%).
  • Total Expenditure: Government expenditure on health as a percentage of total government expenditure (%).
  • Diphtheria: Immunization coverage among one-year-olds (%).
  • HIV/AIDS: Deaths per 1000 population.
  • GDP: Gross Domestic Product (GDP) per capita (USD).
  • Population: Population of the country.
  • Thinness 10-19 Years: Thinness among children from age 10-19 (%).
  • Thinness 5-9 Years: Thinness among children from age 5-9 (%).
  • Resource Income: Index ranging from 0-1.
  • Schooling: Number of years of schooling.

The aim is to identify how those factors affect life expectancy, such factors being demographic variables, income composition, mortality rates, immunization, human development index, social and economic factors.

1.2 Loading the dataset

To load our data set we will use the function read.csv.

#Set a CRAN mirror.
options(repos = c(CRAN = "https://cloud.r-project.org"))

#Loading the data set, expect.
expect<-read_csv("LED.csv")
nuevos_nombres <- c("Country", "Year", "Status", "Life expectancy", "Adult Mortality", "infant deaths", "Alcohol", "percentage expenditure", "Hepatitis B", "Measles", "BMI", "under-five deaths", "Polio", "Total expenditure", "Diphtheria", "HIV/AIDS", "GDP", "Population", "thinness  1-19 years", "thinness 5-9 years", "Resource Income", "Schooling")
names(expect) <- nuevos_nombres
class(expect) #it is a data frame.
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
colnames(expect) #to view the different variables previously mentioned.
##  [1] "Country"                "Year"                   "Status"                
##  [4] "Life expectancy"        "Adult Mortality"        "infant deaths"         
##  [7] "Alcohol"                "percentage expenditure" "Hepatitis B"           
## [10] "Measles"                "BMI"                    "under-five deaths"     
## [13] "Polio"                  "Total expenditure"      "Diphtheria"            
## [16] "HIV/AIDS"               "GDP"                    "Population"            
## [19] "thinness  1-19 years"   "thinness 5-9 years"     "Resource Income"       
## [22] "Schooling"

2 Preliminary exploratory analysis of the data

2.1 Identification and treatment of missing values (NA).

Now we identify the NA values and create a vector with the variables that have more than 5% of NA. For that we create a function called percentageNA and we use an apply function to apply it to the data frame

#We create a function to know which variables have more than 5% missing values.
percentageNA<-function(data){
  c=(sum(is.na(data)))/length(data)*100
  return(c)
}

#Next, we apply the previous function to our dataframe.
contNA<-apply(expect,2,percentageNA)
contNA
##                Country                   Year                 Status 
##              0.0000000              0.0000000              0.0000000 
##        Life expectancy        Adult Mortality          infant deaths 
##              0.3403676              0.3403676              0.0000000 
##                Alcohol percentage expenditure            Hepatitis B 
##              6.6031314              0.0000000             18.8223281 
##                Measles                    BMI      under-five deaths 
##              0.0000000              1.1572498              0.0000000 
##                  Polio      Total expenditure             Diphtheria 
##              0.6466984              7.6923077              0.6466984 
##               HIV/AIDS                    GDP             Population 
##              0.0000000             15.2484683             22.1919673 
##   thinness  1-19 years     thinness 5-9 years        Resource Income 
##              1.1572498              1.1572498              5.6841389 
##              Schooling 
##              5.5479918
#Now, we group the variables with more than 5% missing values.
missing_percentage <- colMeans(is.na(expect))
variables_with_missing <- missing_percentage[missing_percentage > 0.05]
variables_with_missing
##           Alcohol       Hepatitis B Total expenditure               GDP 
##        0.06603131        0.18822328        0.07692308        0.15248468 
##        Population   Resource Income         Schooling 
##        0.22191967        0.05684139        0.05547992

As we can see, the variables with more than 5% of missing values are Alcohol, Hepatitis B, Total expenditure, GDP, Population, Resource Income and Schooling.

Then, we see which of them are numeric.

is_numeric_vector <- sapply(variables_with_missing, is.numeric)
is_numeric_vector
##           Alcohol       Hepatitis B Total expenditure               GDP 
##              TRUE              TRUE              TRUE              TRUE 
##        Population   Resource Income         Schooling 
##              TRUE              TRUE              TRUE

All of the variables with missing values are numeric. Of the variables that have more than 5% missing values we will analyze whether they follow a random pattern. To do this, we’ll study homogeneity according to groups (NA and non-NA) with other variables. To assess whether the means of the variable in the NA group and the non-NA group are significantly different, we’ll perform a statistical test, the Student’s t-test is commonly used for this purpos and it’s the one we will use as our variables are continuous.

Firstly, we’ll define new variables that take the value of 0 if the corresponding data is not missing and the value of 1 if the corresponding data is missing.

Subsequently, a test of the equality of means is conducted between the groups of variable Measles (which has no missing values) corresponding to 0 and 1 of these new variables.

#Defining a function to create the new variables.
indicadoraNA<-function(data,na.rm=F){
  data[!is.na(data)]<-0
  data[is.na(data)]<-1
  return(data)
}

#Preparing to perform the test, we use the previous function.
alcoholNA=indicadoraNA(expect$Alcohol)
hepBNA=indicadoraNA(expect$`Hepatitis B`) 
totalexpNA=indicadoraNA(expect$`Total expenditure`)
gdpNA=indicadoraNA(expect$GDP)
populationNA=indicadoraNA(expect$Population)
incomeNA=indicadoraNA(expect$`Resource Income`)
schoolingNA=indicadoraNA(expect$Schooling)

#Student's t test.
t.test(expect$Measles~alcoholNA, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  expect$Measles by alcoholNA
## t = 1.2405, df = 2936, p-value = 0.2149
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -613.5377 2726.9543
## sample estimates:
## mean in group 0 mean in group 1 
##        2489.368        1432.660
t.test(expect$Measles~hepBNA, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  expect$Measles by hepBNA
## t = -4.5361, df = 2936, p-value = 5.96e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -3504.629 -1389.222
## sample estimates:
## mean in group 0 mean in group 1 
##        1959.024        4405.949
t.test(expect$Measles~totalexpNA, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  expect$Measles by totalexpNA
## t = 1.1044, df = 2936, p-value = 0.2695
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -679.885 2433.465
## sample estimates:
## mean in group 0 mean in group 1 
##        2487.038        1610.248
t.test(expect$Measles~gdpNA, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  expect$Measles by gdpNA
## t = -0.70519, df = 2936, p-value = 0.4807
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -1569.0568   738.9744
## sample estimates:
## mean in group 0 mean in group 1 
##        2356.305        2771.346
t.test(expect$Measles~populationNA, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  expect$Measles by populationNA
## t = 1.1423, df = 2936, p-value = 0.2534
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -416.6876 1579.7638
## sample estimates:
## mean in group 0 mean in group 1 
##        2548.647        1967.109
t.test(expect$Measles~incomeNA, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  expect$Measles by incomeNA
## t = -4.3771, df = 2936, p-value = 1.244e-05
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -5773.207 -2201.069
## sample estimates:
## mean in group 0 mean in group 1 
##        2192.958        6180.096
t.test(expect$Measles~schoolingNA, var.equal=TRUE)
## 
##  Two Sample t-test
## 
## data:  expect$Measles by schoolingNA
## t = -4.4964, df = 2936, p-value = 7.18e-06
## alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
## 95 percent confidence interval:
##  -5948.182 -2335.733
## sample estimates:
## mean in group 0 mean in group 1 
##        2189.797        6331.755

Now we would examine the p-value in each case.

For the variables Alcohol, Total expenditure, GDP and Population p > 0.15, then there is no evidence to reject the null hypothesis of equal means. Therefore, the hypothesis of homogeneity is accepted, and it is concluded that the pattern is random. In the case of homogeneity the pattern is random and, in this case, it is chosen to replace the NA with the mean.

For the remaining variables that we are studying, which are Hepatitis B, Resource Income and Schooling, p < 0.15, so homogeneity cannot be assumed. In this case,this would have to be discussed with the researcher who poses the problem under analysis because they should neither be eliminated nor replaced, but since in this case it is not feasible, it is decided to act as in the case of a random pattern, so in this case, we replace them with the median.

In this case, since the variables are quantitative, missing values are replaced with the mean of the non-missing values.

# Define a function to replace NAs with the mean of each column. Creating the function that handles the missing values.
replace_na_with_mean <- function(column) {
  mean_value <- mean(column, na.rm = TRUE)
  column[is.na(column)] <- mean_value
  return(column)
}

# Apply the replacement function to each column of 'expect'.
expect_imputed <- as.data.frame(lapply(expect, replace_na_with_mean))

2.2 Classical numeric descriptive analysis

Now we do Classical numeric descriptive analysis (measures of central tendency, dispersion, quantiles, symmetry, kurtosis, etc.). As we know the only qualitative variables are country and status, so will analyze those one separately later.

First, we’ll do the quantitative ones. The stat.desc function from the pastecs package provides the most important measures of position, dispersion, and shape:

  • Number of values (nbr.val).
  • Number of null values (nbr.null).
  • Number of missing values (nbr.na).
  • Minimum value of the variable (min).
  • Maximum value of the variable (max).
  • Range of variable values (max-min) (range).
  • Sum of non-missing values (sum).
  • Median (median).
  • Mean (mean).
  • Standard error of the mean (SE.mean).
  • Confidence interval of the mean (CI.mean).
  • Variance (var).
  • Standard deviation (std.dev).
  • Coefficient of variation (coef.var).
  • Skewness coefficient (skewness).
  • Statistic to test if the skewness coefficient is zero (skew.2SE).
  • Kurtosis coefficient (kurtosis).
  • Statistic to test if the kurtosis coefficient is zero (kurt.2SE).
  • Shapiro-Wilks statistic (normtest.W).
  • Probability associated with the Shapiro-Wilks statistic (normtest.p).
#We care for the quantitative variables.
#We create a dataframe 'expect_num' where the NA have been substituted by the mean and the qualitative variables have been removed.
expect_num <- subset(expect_imputed, select = -c(Country, Status))


#Now we study the statistics with the function 'stat.desc'.
pastecs_stats <- stat.desc(expect_num) 
pastecs_stats

As this function doesn’t provide us with the quantiles we’ll use the function summary:

summary(expect_num)
##       Year      Life.expectancy Adult.Mortality infant.deaths   
##  Min.   :2000   Min.   :36.30   Min.   :  1.0   Min.   :   0.0  
##  1st Qu.:2004   1st Qu.:63.20   1st Qu.: 74.0   1st Qu.:   0.0  
##  Median :2008   Median :72.00   Median :144.0   Median :   3.0  
##  Mean   :2008   Mean   :69.22   Mean   :164.8   Mean   :  30.3  
##  3rd Qu.:2012   3rd Qu.:75.60   3rd Qu.:227.0   3rd Qu.:  22.0  
##  Max.   :2015   Max.   :89.00   Max.   :723.0   Max.   :1800.0  
##     Alcohol       percentage.expenditure  Hepatitis.B       Measles        
##  Min.   : 0.010   Min.   :    0.000      Min.   : 1.00   Min.   :     0.0  
##  1st Qu.: 1.093   1st Qu.:    4.685      1st Qu.:80.94   1st Qu.:     0.0  
##  Median : 4.160   Median :   64.913      Median :87.00   Median :    17.0  
##  Mean   : 4.603   Mean   :  738.251      Mean   :80.94   Mean   :  2419.6  
##  3rd Qu.: 7.390   3rd Qu.:  441.534      3rd Qu.:96.00   3rd Qu.:   360.2  
##  Max.   :17.870   Max.   :19479.912      Max.   :99.00   Max.   :212183.0  
##       BMI        under.five.deaths     Polio       Total.expenditure
##  Min.   : 1.00   Min.   :   0.00   Min.   : 3.00   Min.   : 0.370   
##  1st Qu.:19.40   1st Qu.:   0.00   1st Qu.:78.00   1st Qu.: 4.370   
##  Median :43.00   Median :   4.00   Median :93.00   Median : 5.938   
##  Mean   :38.32   Mean   :  42.04   Mean   :82.55   Mean   : 5.938   
##  3rd Qu.:56.10   3rd Qu.:  28.00   3rd Qu.:97.00   3rd Qu.: 7.330   
##  Max.   :87.30   Max.   :2500.00   Max.   :99.00   Max.   :17.600   
##    Diphtheria       HIV.AIDS           GDP              Population       
##  Min.   : 2.00   Min.   : 0.100   Min.   :     1.68   Min.   :3.400e+01  
##  1st Qu.:78.00   1st Qu.: 0.100   1st Qu.:   580.49   1st Qu.:4.189e+05  
##  Median :93.00   Median : 0.100   Median :  3116.56   Median :3.676e+06  
##  Mean   :82.32   Mean   : 1.742   Mean   :  7483.16   Mean   :1.275e+07  
##  3rd Qu.:97.00   3rd Qu.: 0.800   3rd Qu.:  7483.16   3rd Qu.:1.275e+07  
##  Max.   :99.00   Max.   :50.600   Max.   :119172.74   Max.   :1.294e+09  
##  thinness..1.19.years thinness.5.9.years Resource.Income    Schooling    
##  Min.   : 0.10        Min.   : 0.10      Min.   :0.0000   Min.   : 0.00  
##  1st Qu.: 1.60        1st Qu.: 1.60      1st Qu.:0.5042   1st Qu.:10.30  
##  Median : 3.40        Median : 3.40      Median :0.6620   Median :12.10  
##  Mean   : 4.84        Mean   : 4.87      Mean   :0.6276   Mean   :11.99  
##  3rd Qu.: 7.10        3rd Qu.: 7.20      3rd Qu.:0.7720   3rd Qu.:14.10  
##  Max.   :27.70        Max.   :28.60      Max.   :0.9480   Max.   :20.70

Now we care for the qualitative variables:

freq(expect_imputed$Country)
## Frequencies  
## expect_imputed$Country  
## Type: Character  
## 
##                                                              Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------------------------------------------------- ------ --------- -------------- --------- --------------
##                                                Afghanistan     16     0.545          0.545     0.545          0.545
##                                                    Albania     16     0.545          1.089     0.545          1.089
##                                                    Algeria     16     0.545          1.634     0.545          1.634
##                                                     Angola     16     0.545          2.178     0.545          2.178
##                                        Antigua and Barbuda     16     0.545          2.723     0.545          2.723
##                                                  Argentina     16     0.545          3.268     0.545          3.268
##                                                    Armenia     16     0.545          3.812     0.545          3.812
##                                                  Australia     16     0.545          4.357     0.545          4.357
##                                                    Austria     16     0.545          4.901     0.545          4.901
##                                                 Azerbaijan     16     0.545          5.446     0.545          5.446
##                                                    Bahamas     16     0.545          5.990     0.545          5.990
##                                                    Bahrain     16     0.545          6.535     0.545          6.535
##                                                 Bangladesh     16     0.545          7.080     0.545          7.080
##                                                   Barbados     16     0.545          7.624     0.545          7.624
##                                                    Belarus     16     0.545          8.169     0.545          8.169
##                                                    Belgium     16     0.545          8.713     0.545          8.713
##                                                     Belize     16     0.545          9.258     0.545          9.258
##                                                      Benin     16     0.545          9.803     0.545          9.803
##                                                     Bhutan     16     0.545         10.347     0.545         10.347
##                           Bolivia (Plurinational State of)     16     0.545         10.892     0.545         10.892
##                                     Bosnia and Herzegovina     16     0.545         11.436     0.545         11.436
##                                                   Botswana     16     0.545         11.981     0.545         11.981
##                                                     Brazil     16     0.545         12.526     0.545         12.526
##                                          Brunei Darussalam     16     0.545         13.070     0.545         13.070
##                                                   Bulgaria     16     0.545         13.615     0.545         13.615
##                                               Burkina Faso     16     0.545         14.159     0.545         14.159
##                                                    Burundi     16     0.545         14.704     0.545         14.704
##                                                 Cabo Verde     16     0.545         15.248     0.545         15.248
##                                                   Cambodia     16     0.545         15.793     0.545         15.793
##                                                   Cameroon     16     0.545         16.338     0.545         16.338
##                                                     Canada     16     0.545         16.882     0.545         16.882
##                                   Central African Republic     16     0.545         17.427     0.545         17.427
##                                                       Chad     16     0.545         17.971     0.545         17.971
##                                                      Chile     16     0.545         18.516     0.545         18.516
##                                                      China     16     0.545         19.061     0.545         19.061
##                                                   Colombia     16     0.545         19.605     0.545         19.605
##                                                    Comoros     16     0.545         20.150     0.545         20.150
##                                                      Congo     16     0.545         20.694     0.545         20.694
##                                               Cook Islands      1     0.034         20.728     0.034         20.728
##                                                 Costa Rica     16     0.545         21.273     0.545         21.273
##                                              Côte d'Ivoire     16     0.545         21.818     0.545         21.818
##                                                    Croatia     16     0.545         22.362     0.545         22.362
##                                                       Cuba     16     0.545         22.907     0.545         22.907
##                                                     Cyprus     16     0.545         23.451     0.545         23.451
##                                                    Czechia     16     0.545         23.996     0.545         23.996
##                      Democratic People's Republic of Korea     16     0.545         24.541     0.545         24.541
##                           Democratic Republic of the Congo     16     0.545         25.085     0.545         25.085
##                                                    Denmark     16     0.545         25.630     0.545         25.630
##                                                   Djibouti     16     0.545         26.174     0.545         26.174
##                                                   Dominica      1     0.034         26.208     0.034         26.208
##                                         Dominican Republic     16     0.545         26.753     0.545         26.753
##                                                    Ecuador     16     0.545         27.297     0.545         27.297
##                                                      Egypt     16     0.545         27.842     0.545         27.842
##                                                El Salvador     16     0.545         28.387     0.545         28.387
##                                          Equatorial Guinea     16     0.545         28.931     0.545         28.931
##                                                    Eritrea     16     0.545         29.476     0.545         29.476
##                                                    Estonia     16     0.545         30.020     0.545         30.020
##                                                   Ethiopia     16     0.545         30.565     0.545         30.565
##                                                       Fiji     16     0.545         31.110     0.545         31.110
##                                                    Finland     16     0.545         31.654     0.545         31.654
##                                                     France     16     0.545         32.199     0.545         32.199
##                                                      Gabon     16     0.545         32.743     0.545         32.743
##                                                     Gambia     16     0.545         33.288     0.545         33.288
##                                                    Georgia     16     0.545         33.833     0.545         33.833
##                                                    Germany     16     0.545         34.377     0.545         34.377
##                                                      Ghana     16     0.545         34.922     0.545         34.922
##                                                     Greece     16     0.545         35.466     0.545         35.466
##                                                    Grenada     16     0.545         36.011     0.545         36.011
##                                                  Guatemala     16     0.545         36.555     0.545         36.555
##                                                     Guinea     16     0.545         37.100     0.545         37.100
##                                              Guinea-Bissau     16     0.545         37.645     0.545         37.645
##                                                     Guyana     16     0.545         38.189     0.545         38.189
##                                                      Haiti     16     0.545         38.734     0.545         38.734
##                                                   Honduras     16     0.545         39.278     0.545         39.278
##                                                    Hungary     16     0.545         39.823     0.545         39.823
##                                                    Iceland     16     0.545         40.368     0.545         40.368
##                                                      India     16     0.545         40.912     0.545         40.912
##                                                  Indonesia     16     0.545         41.457     0.545         41.457
##                                 Iran (Islamic Republic of)     16     0.545         42.001     0.545         42.001
##                                                       Iraq     16     0.545         42.546     0.545         42.546
##                                                    Ireland     16     0.545         43.091     0.545         43.091
##                                                     Israel     16     0.545         43.635     0.545         43.635
##                                                      Italy     16     0.545         44.180     0.545         44.180
##                                                    Jamaica     16     0.545         44.724     0.545         44.724
##                                                      Japan     16     0.545         45.269     0.545         45.269
##                                                     Jordan     16     0.545         45.813     0.545         45.813
##                                                 Kazakhstan     16     0.545         46.358     0.545         46.358
##                                                      Kenya     16     0.545         46.903     0.545         46.903
##                                                   Kiribati     16     0.545         47.447     0.545         47.447
##                                                     Kuwait     16     0.545         47.992     0.545         47.992
##                                                 Kyrgyzstan     16     0.545         48.536     0.545         48.536
##                           Lao People's Democratic Republic     16     0.545         49.081     0.545         49.081
##                                                     Latvia     16     0.545         49.626     0.545         49.626
##                                                    Lebanon     16     0.545         50.170     0.545         50.170
##                                                    Lesotho     16     0.545         50.715     0.545         50.715
##                                                    Liberia     16     0.545         51.259     0.545         51.259
##                                                      Libya     16     0.545         51.804     0.545         51.804
##                                                  Lithuania     16     0.545         52.349     0.545         52.349
##                                                 Luxembourg     16     0.545         52.893     0.545         52.893
##                                                 Madagascar     16     0.545         53.438     0.545         53.438
##                                                     Malawi     16     0.545         53.982     0.545         53.982
##                                                   Malaysia     16     0.545         54.527     0.545         54.527
##                                                   Maldives     16     0.545         55.071     0.545         55.071
##                                                       Mali     16     0.545         55.616     0.545         55.616
##                                                      Malta     16     0.545         56.161     0.545         56.161
##                                           Marshall Islands      1     0.034         56.195     0.034         56.195
##                                                 Mauritania     16     0.545         56.739     0.545         56.739
##                                                  Mauritius     16     0.545         57.284     0.545         57.284
##                                                     Mexico     16     0.545         57.828     0.545         57.828
##                           Micronesia (Federated States of)     16     0.545         58.373     0.545         58.373
##                                                     Monaco      1     0.034         58.407     0.034         58.407
##                                                   Mongolia     16     0.545         58.952     0.545         58.952
##                                                 Montenegro     16     0.545         59.496     0.545         59.496
##                                                    Morocco     16     0.545         60.041     0.545         60.041
##                                                 Mozambique     16     0.545         60.585     0.545         60.585
##                                                    Myanmar     16     0.545         61.130     0.545         61.130
##                                                    Namibia     16     0.545         61.675     0.545         61.675
##                                                      Nauru      1     0.034         61.709     0.034         61.709
##                                                      Nepal     16     0.545         62.253     0.545         62.253
##                                                Netherlands     16     0.545         62.798     0.545         62.798
##                                                New Zealand     16     0.545         63.342     0.545         63.342
##                                                  Nicaragua     16     0.545         63.887     0.545         63.887
##                                                      Niger     16     0.545         64.432     0.545         64.432
##                                                    Nigeria     16     0.545         64.976     0.545         64.976
##                                                       Niue      1     0.034         65.010     0.034         65.010
##                                                     Norway     16     0.545         65.555     0.545         65.555
##                                                       Oman     16     0.545         66.099     0.545         66.099
##                                                   Pakistan     16     0.545         66.644     0.545         66.644
##                                                      Palau      1     0.034         66.678     0.034         66.678
##                                                     Panama     16     0.545         67.223     0.545         67.223
##                                           Papua New Guinea     16     0.545         67.767     0.545         67.767
##                                                   Paraguay     16     0.545         68.312     0.545         68.312
##                                                       Peru     16     0.545         68.856     0.545         68.856
##                                                Philippines     16     0.545         69.401     0.545         69.401
##                                                     Poland     16     0.545         69.946     0.545         69.946
##                                                   Portugal     16     0.545         70.490     0.545         70.490
##                                                      Qatar     16     0.545         71.035     0.545         71.035
##                                          Republic of Korea     16     0.545         71.579     0.545         71.579
##                                        Republic of Moldova     16     0.545         72.124     0.545         72.124
##                                                    Romania     16     0.545         72.668     0.545         72.668
##                                         Russian Federation     16     0.545         73.213     0.545         73.213
##                                                     Rwanda     16     0.545         73.758     0.545         73.758
##                                      Saint Kitts and Nevis      1     0.034         73.792     0.034         73.792
##                                                Saint Lucia     16     0.545         74.336     0.545         74.336
##                           Saint Vincent and the Grenadines     16     0.545         74.881     0.545         74.881
##                                                      Samoa     16     0.545         75.425     0.545         75.425
##                                                 San Marino      1     0.034         75.459     0.034         75.459
##                                      Sao Tome and Principe     16     0.545         76.004     0.545         76.004
##                                               Saudi Arabia     16     0.545         76.549     0.545         76.549
##                                                    Senegal     16     0.545         77.093     0.545         77.093
##                                                     Serbia     16     0.545         77.638     0.545         77.638
##                                                 Seychelles     16     0.545         78.182     0.545         78.182
##                                               Sierra Leone     16     0.545         78.727     0.545         78.727
##                                                  Singapore     16     0.545         79.272     0.545         79.272
##                                                   Slovakia     16     0.545         79.816     0.545         79.816
##                                                   Slovenia     16     0.545         80.361     0.545         80.361
##                                            Solomon Islands     16     0.545         80.905     0.545         80.905
##                                                    Somalia     16     0.545         81.450     0.545         81.450
##                                               South Africa     16     0.545         81.995     0.545         81.995
##                                                South Sudan     16     0.545         82.539     0.545         82.539
##                                                      Spain     16     0.545         83.084     0.545         83.084
##                                                  Sri Lanka     16     0.545         83.628     0.545         83.628
##                                                      Sudan     16     0.545         84.173     0.545         84.173
##                                                   Suriname     16     0.545         84.717     0.545         84.717
##                                                  Swaziland     16     0.545         85.262     0.545         85.262
##                                                     Sweden     16     0.545         85.807     0.545         85.807
##                                                Switzerland     16     0.545         86.351     0.545         86.351
##                                       Syrian Arab Republic     16     0.545         86.896     0.545         86.896
##                                                 Tajikistan     16     0.545         87.440     0.545         87.440
##                                                   Thailand     16     0.545         87.985     0.545         87.985
##                  The former Yugoslav republic of Macedonia     16     0.545         88.530     0.545         88.530
##                                                Timor-Leste     16     0.545         89.074     0.545         89.074
##                                                       Togo     16     0.545         89.619     0.545         89.619
##                                                      Tonga     16     0.545         90.163     0.545         90.163
##                                        Trinidad and Tobago     16     0.545         90.708     0.545         90.708
##                                                    Tunisia     16     0.545         91.253     0.545         91.253
##                                                     Turkey     16     0.545         91.797     0.545         91.797
##                                               Turkmenistan     16     0.545         92.342     0.545         92.342
##                                                     Tuvalu      1     0.034         92.376     0.034         92.376
##                                                     Uganda     16     0.545         92.920     0.545         92.920
##                                                    Ukraine     16     0.545         93.465     0.545         93.465
##                                       United Arab Emirates     16     0.545         94.010     0.545         94.010
##       United Kingdom of Great Britain and Northern Ireland     16     0.545         94.554     0.545         94.554
##                                United Republic of Tanzania     16     0.545         95.099     0.545         95.099
##                                   United States of America     16     0.545         95.643     0.545         95.643
##                                                    Uruguay     16     0.545         96.188     0.545         96.188
##                                                 Uzbekistan     16     0.545         96.732     0.545         96.732
##                                                    Vanuatu     16     0.545         97.277     0.545         97.277
##                         Venezuela (Bolivarian Republic of)     16     0.545         97.822     0.545         97.822
##                                                   Viet Nam     16     0.545         98.366     0.545         98.366
##                                                      Yemen     16     0.545         98.911     0.545         98.911
##                                                     Zambia     16     0.545         99.455     0.545         99.455
##                                                   Zimbabwe     16     0.545        100.000     0.545        100.000
##                                                       <NA>      0                              0.000        100.000
##                                                      Total   2938   100.000        100.000   100.000        100.000
freq(expect_imputed$Status)
## Frequencies  
## expect_imputed$Status  
## Type: Character  
## 
##                    Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ---------------- ------ --------- -------------- --------- --------------
##        Developed    512     17.43          17.43     17.43          17.43
##       Developing   2426     82.57         100.00     82.57         100.00
##             <NA>      0                               0.00         100.00
##            Total   2938    100.00         100.00    100.00         100.00

In the variable Country we can observe that a smaller sample has been taken for some countries, which are: Cook Islands, Dominica, Marshall Islands, Monaco, Nauru, Nive, Palau, Saint Kitts and Nevis, San Marino and Tuvalu. This is made as these countries have a smaller population.

2.3 Extreme values (outliers)

To detect outliers, a preliminary exploratory analysis is required by constructing boxplots for all variables. The following visualization is not the best due to the difference between the scales.

boxplot(expect_num[-1],main="Outliers",
        xlab="All explanatory variables",
        ylab="Values",
        col=c(1:11),
        las = 2)

The following visualization is not affected by the difference between the scales as if the quantitative variables are standardized, the effect of scales differences is erased

sca<-scale(expect_num[-1])
boxplot(sca,main="Outliers",
        xlab="All explanatory variables",
        ylab="Values",
        col=c(1:11))

As we can see there are many outliers, so we could not eliminate them. We’ll have to substitute them for the mean.

Since all variables are quantitative, the decision has been made to replace them with the mean. To achieve this, the “outlier” function has been developed to perform the detection and manipulation of outliers. It is worth noting that the decision to replace outliers with the mean would necessitate a prior analysis of the cause of these outlier values.

#Defining the outlier function.
outlier<-function(data,na.rm=T){
  H<-1.5*IQR(data, na.rm=T)
  data[data<quantile(data,0.25, na.rm = T)-H]<-NA
  data[data>quantile(data,0.75, na.rm = T)+H]<-NA
  data[is.na(data)]<-mean(data, na.rm = T)

  H<-1.5*IQR(data)

  if (TRUE %in% (data<quantile(data,0.25,na.rm = T)-H) |
      TRUE %in% (data>quantile(data,0.75,na.rm = T)+H))
    outlier(data)
  else
    return(data)
}

# Application of the outlier function to substitute the outliers for the mean.
data<-as.data.frame(apply(expect_num,2,outlier))

# Lastly, a comparison between the original data and the corrected data with the respective boxplots.
par(mfrow=c(1,2))

# Boxplot of the original data.
boxplot(expect_num,main="Original data",
        xlab="Explanatory variables",
        ylab="Values",
        col=c(1:ncol(expect_num)))

# Boxplot of the corrected data.
boxplot(data,main="Data without outliers",
        xlab="Explanatory variables",
        ylab="Values",
        col=c(1:ncol(data)))

As we can see, the data is not scaled, so now we’ll see the boxplots with scaled data.

#We apply the function to substitute the outliers.
data_norm<-as.data.frame(apply(sca,2,outlier))

#Lastly, a comparison between the original data and the corrected data with the respective boxplots.
par(mfrow=c(1,2))

# Boxplot of the original data.
boxplot(sca,main="Original data",
        xlab="Explanatory variables",
        ylab="Values",
        col=c(1:ncol(expect_num)))

# Boxplot of the corrected data.
boxplot(data_norm,main="Data without outliers",
        xlab="Explanatory variables",
        ylab="Values",
        col=c(1:ncol(expect_num)))

3 Principal Component Analysis (PCA)

First, it is necessary to verify that the variables are not independent:

-At the level of the sample collected in the database, this can be done by calculating and observing the correlation matrix.

-At a population level, it will be justified that there is a correlation using ‘Bartlett’s Test of sphericity’. This test checks if the correlations are significantly different from 0.The null hypothesis is H_0: det(R) = 1; it means that the variables are not correlated, R denotes the correlation matrix. This function works with standardized data.

#Correlation matrix.
correlation_matrix<- cor(data)
correlation_matrix
##                               Year Life.expectancy Adult.Mortality
## Year                    1.00000000      0.15250262     -0.01657843
## Life.expectancy         0.15250262      1.00000000     -0.57411530
## Adult.Mortality        -0.01657843     -0.57411530      1.00000000
## infant.deaths          -0.03193416     -0.50566581      0.33970610
## Alcohol                -0.04925409      0.39542286     -0.21055256
## percentage.expenditure -0.04302649      0.30478901     -0.20762533
## Hepatitis.B             0.18727322      0.32680387     -0.20138411
## Measles                -0.05629980     -0.24356404      0.13748341
## BMI                     0.10832670      0.55644633     -0.36084781
## under.five.deaths      -0.02627466     -0.44562526      0.31075365
## Polio                   0.08075056      0.40103560     -0.25089144
## Total.expenditure       0.04554965      0.22408065     -0.16141448
## Diphtheria              0.09997122      0.40414736     -0.26518865
## HIV.AIDS               -0.04205825     -0.66178696      0.40502939
## GDP                     0.12427861      0.29125683     -0.12147221
## Population              0.01953851     -0.01060332      0.03390156
## thinness..1.19.years   -0.04775821     -0.60932003      0.35890225
## thinness.5.9.years     -0.03322319     -0.60668780      0.36984153
## Resource.Income         0.12855710      0.81448108     -0.49792606
## Schooling               0.13954171      0.67848169     -0.40594116
##                        infant.deaths     Alcohol percentage.expenditure
## Year                     -0.03193416 -0.04925409           -0.043026485
## Life.expectancy          -0.50566581  0.39542286            0.304789009
## Adult.Mortality           0.33970610 -0.21055256           -0.207625335
## infant.deaths             1.00000000 -0.33964546           -0.269115025
## Alcohol                  -0.33964546  1.00000000            0.189363467
## percentage.expenditure   -0.26911502  0.18936347            1.000000000
## Hepatitis.B              -0.28981084  0.11861959            0.129193575
## Measles                   0.38129436 -0.13953262           -0.148422850
## BMI                      -0.40463100  0.31656838            0.222071629
## under.five.deaths         0.69124728 -0.31255232           -0.212785308
## Polio                    -0.28412162  0.22006671            0.149810964
## Total.expenditure        -0.22908354  0.29892302            0.107914633
## Diphtheria               -0.29540580  0.22986006            0.154403667
## HIV.AIDS                  0.38581024 -0.18461498           -0.184295246
## GDP                      -0.15742062  0.24849679           -0.001114271
## Population                0.19347234 -0.06955884           -0.184717447
## thinness..1.19.years      0.35012332 -0.41234498           -0.224947396
## thinness.5.9.years        0.36226862 -0.40712881           -0.226128862
## Resource.Income          -0.50987566  0.49854360            0.360029134
## Schooling                -0.46652459  0.50180812            0.304087449
##                        Hepatitis.B     Measles         BMI under.five.deaths
## Year                    0.18727322 -0.05629980  0.10832670       -0.02627466
## Life.expectancy         0.32680387 -0.24356404  0.55644633       -0.44562526
## Adult.Mortality        -0.20138411  0.13748341 -0.36084781        0.31075365
## infant.deaths          -0.28981084  0.38129436 -0.40463100        0.69124728
## Alcohol                 0.11861959 -0.13953262  0.31656838       -0.31255232
## percentage.expenditure  0.12919358 -0.14842285  0.22207163       -0.21278531
## Hepatitis.B             1.00000000 -0.15603801  0.21003355       -0.21589883
## Measles                -0.15603801  1.00000000 -0.23745465        0.31577327
## BMI                     0.21003355 -0.23745465  1.00000000       -0.32887835
## under.five.deaths      -0.21589883  0.31577327 -0.32887835        1.00000000
## Polio                   0.49788803 -0.12400568  0.21561855       -0.24101112
## Total.expenditure       0.02487489 -0.14539433  0.20785582       -0.16825635
## Diphtheria              0.50428189 -0.12729666  0.21950700       -0.24348076
## HIV.AIDS               -0.22736643  0.17223318 -0.43900559        0.35721178
## GDP                     0.18512152 -0.08828064  0.25075566       -0.17347086
## Population              0.03654473  0.12717832 -0.01934333        0.13498898
## thinness..1.19.years   -0.12252064  0.22938223 -0.51910743        0.32586046
## thinness.5.9.years     -0.13242165  0.24123086 -0.52830322        0.32985369
## Resource.Income         0.32968635 -0.23665929  0.58175142       -0.46813662
## Schooling               0.28325557 -0.21253215  0.51536836       -0.41267225
##                              Polio Total.expenditure  Diphtheria    HIV.AIDS
## Year                    0.08075056        0.04554965  0.09997122 -0.04205825
## Life.expectancy         0.40103560        0.22408065  0.40414736 -0.66178696
## Adult.Mortality        -0.25089144       -0.16141448 -0.26518865  0.40502939
## infant.deaths          -0.28412162       -0.22908354 -0.29540580  0.38581024
## Alcohol                 0.22006671        0.29892302  0.22986006 -0.18461498
## percentage.expenditure  0.14981096        0.10791463  0.15440367 -0.18429525
## Hepatitis.B             0.49788803        0.02487489  0.50428189 -0.22736643
## Measles                -0.12400568       -0.14539433 -0.12729666  0.17223318
## BMI                     0.21561855        0.20785582  0.21950700 -0.43900559
## under.five.deaths      -0.24101112       -0.16825635 -0.24348076  0.35721178
## Polio                   1.00000000        0.08557358  0.83402939 -0.34260272
## Total.expenditure       0.08557358        1.00000000  0.10029069 -0.11642070
## Diphtheria              0.83402939        0.10029069  1.00000000 -0.34559476
## HIV.AIDS               -0.34260272       -0.11642070 -0.34559476  1.00000000
## GDP                     0.18933416        0.08210680  0.18731256 -0.25862973
## Population              0.01198798       -0.14634868  0.03148769 -0.03028617
## thinness..1.19.years   -0.17506796       -0.31176037 -0.19632835  0.44316357
## thinness.5.9.years     -0.19532547       -0.32077192 -0.21411901  0.42309886
## Resource.Income         0.39473106        0.18272611  0.39137309 -0.56461340
## Schooling               0.37405864        0.25232368  0.36831656 -0.46831628
##                                 GDP  Population thinness..1.19.years
## Year                    0.124278608  0.01953851          -0.04775821
## Life.expectancy         0.291256832 -0.01060332          -0.60932003
## Adult.Mortality        -0.121472214  0.03390156           0.35890225
## infant.deaths          -0.157420621  0.19347234           0.35012332
## Alcohol                 0.248496790 -0.06955884          -0.41234498
## percentage.expenditure -0.001114271 -0.18471745          -0.22494740
## Hepatitis.B             0.185121522  0.03654473          -0.12252064
## Measles                -0.088280636  0.12717832           0.22938223
## BMI                     0.250755657 -0.01934333          -0.51910743
## under.five.deaths      -0.173470856  0.13498898           0.32586046
## Polio                   0.189334157  0.01198798          -0.17506796
## Total.expenditure       0.082106795 -0.14634868          -0.31176037
## Diphtheria              0.187312558  0.03148769          -0.19632835
## HIV.AIDS               -0.258629727 -0.03028617           0.44316357
## GDP                     1.000000000  0.28904708          -0.19123381
## Population              0.289047078  1.00000000           0.05586769
## thinness..1.19.years   -0.191233814  0.05586769           1.00000000
## thinness.5.9.years     -0.192708380  0.06125504           0.92154176
## Resource.Income         0.362835252 -0.03309412          -0.56734810
## Schooling               0.314214831 -0.05425340          -0.50681145
##                        thinness.5.9.years Resource.Income  Schooling
## Year                          -0.03322319      0.12855710  0.1395417
## Life.expectancy               -0.60668780      0.81448108  0.6784817
## Adult.Mortality                0.36984153     -0.49792606 -0.4059412
## infant.deaths                  0.36226862     -0.50987566 -0.4665246
## Alcohol                       -0.40712881      0.49854360  0.5018081
## percentage.expenditure        -0.22612886      0.36002913  0.3040874
## Hepatitis.B                   -0.13242165      0.32968635  0.2832556
## Measles                        0.24123086     -0.23665929 -0.2125321
## BMI                           -0.52830322      0.58175142  0.5153684
## under.five.deaths              0.32985369     -0.46813662 -0.4126722
## Polio                         -0.19532547      0.39473106  0.3740586
## Total.expenditure             -0.32077192      0.18272611  0.2523237
## Diphtheria                    -0.21411901      0.39137309  0.3683166
## HIV.AIDS                       0.42309886     -0.56461340 -0.4683163
## GDP                           -0.19270838      0.36283525  0.3142148
## Population                     0.06125504     -0.03309412 -0.0542534
## thinness..1.19.years           0.92154176     -0.56734810 -0.5068114
## thinness.5.9.years             1.00000000     -0.56094672 -0.5065113
## Resource.Income               -0.56094672      1.00000000  0.8229748
## Schooling                     -0.50651129      0.82297477  1.0000000
#Standardized data.
data_scaled<- scale(data)

#Bartlett's Test of sphericity.
cortest.bartlett(cor(data_scaled))
## $chisq
## [1] 961.4059
## 
## $p.value
## [1] 2.441366e-103
## 
## $df
## [1] 190
  1. $chisq: This value represents the chi-square statistic obtained from ‘Bartlett’s test of sphericity’. It measures the discrepancy between the observed correlation matrix and an identity matrix, i.e., a matrix in which all variables are uncorrelated. It is a measure of how much the observed correlations deviate from the expected correlations if the variables were uncorrelated; thus, a large value indicates that the variables are correlated.

  2. $p.value: This is the p-value associated with the chi-square statistic. It indicates the probability of obtaining a chi-square statistic as extreme as the calculated one (or more extreme) if the variables were truly uncorrelated. Since the p-value is extremely small, it provides strong evidence against the null hypothesis that the variables are not correlated. The null hypothesis is rejected, meaning that det(R) = 1; hence det(R) != 1.

  3. $df: These are the degrees of freedom associated with the chi-square statistic, in this case, they are 190.

After this, it is requested to carry out a study of the possibility of reducing the dimension through observable variables. It is convenient to choose the optimal number of principal components using different graphical techniques.

Firstly, we’ll study the polychoric correlation matrix.

poly_cor<- hetcor(expect_num)$correlations
ggcorrplot(poly_cor, type="lower",hc.order=T,  tl.cex = 10)

A polychoric correlation matrix evaluates the relationships between two or more variables. This matrix quantifies the relationship between pairs of variables and provides a measure of how these variables tend to move together or in opposite directions. Polychoric correlation values range from -1 to 1.

A positive value indicates a positive relationship, meaning that as the values of one variable increase, the values of the other tend to increase. Conversely, a negative value suggests a negative relationship, where an increase in the values of one variable is associated with a decrease in the values of the other. When the value approaches 0, it indicates a weak relationship between the variables. The magnitude of the value reflects the strength of the observed association. In this case, as the relationship takes on a redder hue, it indicates a more direct correlation between the variables. On the other hand, as it becomes bluer, it suggests a stronger negative correlation. When the color approaches white, it indicates that the variables have a weaker relationship.

#Another interesting representation would be the following one.
corrplot(correlation_matrix, order = "hclust", tl.col='black', tl.cex=0.7)

This is a graphical representation of the correlation matrix. The Pearson correlation matrix is computed for the data in our dataset, meaning pairwise correlations are calculated between all variables in the dataset, yielding a square correlation matrix. Positive correlations are depicted by an intense blue color, which is why the diagonal of the matrix is blue since all variables are entirely correlated with themselves. Negative correlations are indicated by an intense red color.

Observing the graphical representations, it appears that there are between 4 and 5 groups of latent variables highly correlated with one group of observable variables and weakly correlated with the rest. This provides an intuitive sense of the optimal number of factors for our Factor Analysis.

Based on the results obtained in the previous sections, we are ready to perform dimensionality reduction using Principal Component Analysis (PCA) or Factor Analysis, as the data is correlated, contains no missing values, and the presence of outliers has been avoided.

3.1 Implementation of Principal Component Analysis (PCA)

The following code performs PCA, obtaining the eigenvectors that generate each component, as well as their eigenvalues, which correspond to the variance of each component.

#Using prcomp to compute the principal components (eigenvalues and eigenvectors). 
#With scale=TRUE, variable means are set to zero, and variances set to one.
expect_pca <- prcomp(data,scale=TRUE)

#The "rotation" field of the "expect_pca" object is a matrix whose columns
#are the coefficients of the principal components, i.e., the
#weight of each variable in the corresponding principal component.
expect_pca$rotation
##                                PC1           PC2         PC3         PC4
## Year                   -0.05039560  0.1635392780 -0.11615031 -0.43419326
## Life.expectancy        -0.33036601 -0.0007331087 -0.10183706  0.03802842
## Adult.Mortality         0.22223435  0.0169965057  0.01445102 -0.14470781
## infant.deaths           0.25549725  0.0547922822 -0.32417394  0.30325989
## Alcohol                -0.21116083 -0.1321737482 -0.03421661  0.11471888
## percentage.expenditure -0.14737094 -0.0988024316  0.26515739  0.23292234
## Hepatitis.B            -0.16377588  0.4225957589  0.13271579 -0.06245054
## Measles                 0.14037039  0.0886268312 -0.27396652  0.45518813
## BMI                    -0.25820730 -0.1146238362 -0.13527631 -0.07385094
## under.five.deaths       0.23095279  0.0602532507 -0.27424724  0.34225124
## Polio                  -0.20190862  0.4786971839  0.14835802  0.24737499
## Total.expenditure      -0.12954884 -0.2216846278  0.06196571  0.06779824
## Diphtheria             -0.20545590  0.4743879267  0.13889388  0.24203717
## HIV.AIDS                0.25390036 -0.0545466009  0.11835201 -0.02977423
## GDP                    -0.14225915  0.1572181191 -0.40731197 -0.32399638
## Population              0.02898727  0.2360778373 -0.55146002 -0.12839398
## thinness..1.19.years    0.27376956  0.2798402880  0.19574895 -0.15381806
## thinness.5.9.years      0.27546647  0.2725863602  0.18013825 -0.15884438
## Resource.Income        -0.33351334 -0.0013924645 -0.09152134  0.01037712
## Schooling              -0.30567926 -0.0118349625 -0.06982695  0.01750321
##                                PC5          PC6         PC7         PC8
## Year                    0.29718445  0.661637939 -0.28330347 -0.10291026
## Life.expectancy        -0.15950003  0.132478949 -0.02364510 -0.09667275
## Adult.Mortality         0.31446611 -0.137454065 -0.07377528  0.38922069
## infant.deaths           0.09678651  0.181791990 -0.04174313  0.16406518
## Alcohol                 0.34299350 -0.412901663 -0.28134847 -0.02628749
## percentage.expenditure -0.19055406  0.049064971 -0.51180870  0.49353017
## Hepatitis.B             0.09188185  0.111336717  0.02851010  0.15639876
## Measles                -0.04983325  0.024776285 -0.34206666 -0.48582766
## BMI                    -0.06204044  0.126781094  0.01567170  0.10354635
## under.five.deaths       0.14827527  0.246299336 -0.04586746  0.24357189
## Polio                   0.14983740 -0.048873652  0.14190817  0.02652263
## Total.expenditure       0.60195405 -0.040231404  0.09841795 -0.30030032
## Diphtheria              0.16361939 -0.036229926  0.15681018  0.02834705
## HIV.AIDS                0.32842048 -0.090106059 -0.18405402  0.14697691
## GDP                     0.09800907 -0.339765315 -0.18328085  0.06966947
## Population             -0.12718281 -0.250535195  0.13229450  0.15466548
## thinness..1.19.years   -0.11563399 -0.142980560 -0.26641079 -0.20017103
## thinness.5.9.years     -0.12501773 -0.131968640 -0.28418414 -0.20589095
## Resource.Income        -0.10908327  0.002337521 -0.25590327 -0.01027066
## Schooling               0.03401024 -0.024270068 -0.30677835 -0.04714555
##                                PC9         PC10         PC11         PC12
## Year                    0.05936855 -0.079401382  0.023821523  0.298831272
## Life.expectancy        -0.03705585  0.076479408 -0.020004876  0.077097634
## Adult.Mortality         0.37616897 -0.296412110 -0.476945044 -0.025575316
## infant.deaths          -0.06406277  0.264736129 -0.036675577  0.015382992
## Alcohol                 0.26298836  0.221702937  0.256656505  0.135673548
## percentage.expenditure -0.39752397 -0.299278076 -0.061495628  0.121803736
## Hepatitis.B             0.01196103 -0.196458854  0.397705360 -0.564135424
## Measles                 0.21863596 -0.493725129 -0.035019466 -0.165005312
## BMI                     0.03448427  0.032832434 -0.039788432 -0.500995452
## under.five.deaths      -0.15219049  0.386971809 -0.038094806 -0.171131599
## Polio                   0.04559221  0.051736363 -0.149454886  0.156649052
## Total.expenditure      -0.59883409 -0.150285934 -0.121504583 -0.091403888
## Diphtheria              0.02638437  0.014306985 -0.108688002  0.184280295
## HIV.AIDS                0.07103142  0.006480065  0.538396775 -0.008838757
## GDP                    -0.15123464  0.067912604 -0.300102739 -0.247618245
## Population             -0.27691658 -0.290513699  0.317218284  0.325544622
## thinness..1.19.years   -0.19040089  0.195483658 -0.054497180 -0.070575560
## thinness.5.9.years     -0.17581079  0.190400672 -0.074237654 -0.051466917
## Resource.Income         0.08183350  0.161231231  0.004636026  0.016336700
## Schooling               0.10087218  0.207399033 -0.012251106  0.052363638
##                               PC13        PC14         PC15        PC16
## Year                    0.01225355  0.09607309  0.188071502  0.01947916
## Life.expectancy         0.01406141 -0.19354098  0.003276960  0.08889360
## Adult.Mortality        -0.26254004 -0.27504463 -0.094641078  0.01265549
## infant.deaths           0.06826052 -0.04982975  0.007183562 -0.75572983
## Alcohol                -0.07240371 -0.09273099  0.579762201 -0.01734889
## percentage.expenditure  0.06737358  0.09280136  0.125840916 -0.02242875
## Hepatitis.B             0.14526661 -0.41574455  0.073587810 -0.09980227
## Measles                 0.02898145  0.08914250 -0.012434305  0.07091039
## BMI                    -0.59019078  0.48155436  0.131824238 -0.10234442
## under.five.deaths      -0.05238575 -0.12418083  0.059860753  0.61419849
## Polio                  -0.04526815  0.21966450 -0.064415795 -0.01890306
## Total.expenditure      -0.13147338 -0.14267944 -0.087935911 -0.06140062
## Diphtheria             -0.04112841  0.22689628 -0.004193625  0.03762600
## HIV.AIDS                0.04982527  0.39639881 -0.433987090  0.06138024
## GDP                     0.51860951  0.25930036 -0.023012315  0.04320723
## Population             -0.34796672 -0.09606712 -0.013910461  0.04341353
## thinness..1.19.years   -0.21692368 -0.02034840  0.037180148 -0.01845632
## thinness.5.9.years     -0.21564206 -0.06987844  0.074687436 -0.03907094
## Resource.Income        -0.06191748 -0.15281685 -0.267838242 -0.03080529
## Schooling              -0.17982142 -0.21799842 -0.545392981 -0.03233309
##                                PC17          PC18         PC19          PC20
## Year                   -0.070556053 -0.0292078317  0.028457414 -1.724463e-02
## Life.expectancy         0.738177883  0.0356133583 -0.469071545  2.840843e-02
## Adult.Mortality         0.204128461  0.0284589990 -0.025531212 -1.173581e-02
## infant.deaths           0.079047447  0.0386073739 -0.040013419 -1.245200e-02
## Alcohol                -0.029213798 -0.0242018304 -0.033141027 -1.518932e-02
## percentage.expenditure -0.046416878 -0.0148282696 -0.040558753  3.722456e-05
## Hepatitis.B            -0.047531764 -0.0071804854 -0.013217699 -7.779322e-04
## Measles                 0.002696099 -0.0036986803 -0.003633262 -5.436664e-03
## BMI                     0.017229725  0.0074788883 -0.043872587  1.872838e-02
## under.five.deaths      -0.043080910 -0.0189290410  0.041804613  2.524983e-03
## Polio                   0.054642421 -0.7004222293  0.026961238  2.127814e-02
## Total.expenditure       0.057638804 -0.0007739644  0.070895675  1.560641e-03
## Diphtheria             -0.036664271  0.7062234145 -0.002644184  3.299977e-03
## HIV.AIDS                0.305711419  0.0192128199 -0.062335140  4.159963e-02
## GDP                     0.002753927  0.0089570525 -0.054900541  1.117356e-02
## Population             -0.032138654 -0.0299204213  0.021237575 -3.515843e-03
## thinness..1.19.years    0.083182767  0.0081735814 -0.069454869 -7.046161e-01
## thinness.5.9.years      0.072003239  0.0272959352  0.028708265  7.038266e-01
## Resource.Income         0.248607250  0.0527229940  0.775987881 -5.648313e-02
## Schooling              -0.467663707 -0.0036697603 -0.386509071  2.459497e-02
#In the "sdev" field of the "expect_pca" object, along with the summary function applied
#to the object, relevant information is obtained: standard deviations of
#each principal component, proportion of explained and cumulative variance.
expect_pca$sdev
##  [1] 2.6232180 1.3778737 1.2203912 1.0552838 1.0334747 1.0127769 0.9493830
##  [8] 0.8955225 0.8483023 0.7908919 0.7866141 0.7538461 0.7298716 0.6969106
## [15] 0.6265453 0.5387641 0.4753598 0.4047081 0.3468292 0.2769408
summary(expect_pca)
## Importance of components:
##                           PC1     PC2     PC3     PC4    PC5     PC6     PC7
## Standard deviation     2.6232 1.37787 1.22039 1.05528 1.0335 1.01278 0.94938
## Proportion of Variance 0.3441 0.09493 0.07447 0.05568 0.0534 0.05129 0.04507
## Cumulative Proportion  0.3441 0.43899 0.51346 0.56914 0.6225 0.67383 0.71890
##                           PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.8955 0.84830 0.79089 0.78661 0.75385 0.72987 0.69691
## Proportion of Variance 0.0401 0.03598 0.03128 0.03094 0.02841 0.02664 0.02428
## Cumulative Proportion  0.7590 0.79497 0.82625 0.85719 0.88560 0.91224 0.93652
##                           PC15    PC16   PC17    PC18    PC19    PC20
## Standard deviation     0.62655 0.53876 0.4754 0.40471 0.34683 0.27694
## Proportion of Variance 0.01963 0.01451 0.0113 0.00819 0.00601 0.00383
## Cumulative Proportion  0.95615 0.97066 0.9820 0.99015 0.99617 1.00000

The following code explains the behaviour of the explained variance for each principal component and the behaviour of the cumulative explained variance.

eigen_expect <- expect_pca$sdev^2

names(eigen_expect) <- paste("PC",1:20,sep="")
eigen_expect
##        PC1        PC2        PC3        PC4        PC5        PC6        PC7 
## 6.88127260 1.89853604 1.48935464 1.11362380 1.06806999 1.02571703 0.90132806 
##        PC8        PC9       PC10       PC11       PC12       PC13       PC14 
## 0.80196053 0.71961676 0.62551002 0.61876173 0.56828399 0.53271250 0.48568438 
##       PC15       PC16       PC17       PC18       PC19       PC20 
## 0.39255898 0.29026670 0.22596694 0.16378861 0.12029050 0.07669621
sumlambdas <- sum(eigen_expect)
sumlambdas
## [1] 20
#shows the percent variance each principal component holds
propvar <- eigen_expect/sumlambdas
propvar 
##         PC1         PC2         PC3         PC4         PC5         PC6 
## 0.344063630 0.094926802 0.074467732 0.055681190 0.053403500 0.051285851 
##         PC7         PC8         PC9        PC10        PC11        PC12 
## 0.045066403 0.040098026 0.035980838 0.031275501 0.030938087 0.028414199 
##        PC13        PC14        PC15        PC16        PC17        PC18 
## 0.026635625 0.024284219 0.019627949 0.014513335 0.011298347 0.008189431 
##        PC19        PC20 
## 0.006014525 0.003834811
#shows the cumulative percent variance 
percent <- cumsum(propvar)
percent 
##       PC1       PC2       PC3       PC4       PC5       PC6       PC7       PC8 
## 0.3440636 0.4389904 0.5134582 0.5691394 0.6225429 0.6738287 0.7188951 0.7589931 
##       PC9      PC10      PC11      PC12      PC13      PC14      PC15      PC16 
## 0.7949740 0.8262495 0.8571876 0.8856018 0.9122374 0.9365216 0.9561496 0.9706629 
##      PC17      PC18      PC19      PC20 
## 0.9819612 0.9901507 0.9961652 1.0000000

The following graphs explain the behaviour of the explained variance for each principal component and the behaviour of the cumulative explained variance.

#Proportion of the explained variance.
expl_var <- expect_pca$sdev^2 / sum(expect_pca$sdev^2)

ggplot(data = data.frame(expl_var, pc = 1:ncol(data)),
       aes(x = pc, y = expl_var, fill=expl_var )) +
  geom_col(width = 0.3) +
  scale_y_continuous(limits = c(0,0.6)) + theme_bw() +
  labs(x = "Principal component", y = "Proportion of explained variance")

#Proportion of the cumulative explained variance.
var_cum<-cumsum(expl_var)

ggplot( data = data.frame(var_cum, pc = 1:ncol(data)),
        aes(x = pc, y = var_cum ,fill=var_cum )) +
  geom_col(width = 0.5) +
  scale_y_continuous(limits = c(0,1)) +
  theme_bw() +
  labs(x = "Principal component", y = "Proportion of cumulative explained variance")

To choose the appropriate number of principal components, the following method is used: The variances explained by the principal components are averaged, and those whose proportion of explained variance exceeds the mean are selected.

expect_pca$sdev^2
##  [1] 6.88127260 1.89853604 1.48935464 1.11362380 1.06806999 1.02571703
##  [7] 0.90132806 0.80196053 0.71961676 0.62551002 0.61876173 0.56828399
## [13] 0.53271250 0.48568438 0.39255898 0.29026670 0.22596694 0.16378861
## [19] 0.12029050 0.07669621
mean(expect_pca$sdev^2)
## [1] 1

This analysis recommends taking 6 as the optimal number of principal components. It is necessary that we only retain enough components to explain a specified significantly large percentage of the total variance in the original variables.Typically, values between 70% and 90% are suggested, although lower values may be appropriate as the sample size increases. As this dataset has almost 3000 observations, it is safe to choose 6 principal components because it represents nearly 70% (67.3828705%) of the total variance. If a higher percentage is desired, one could opt for 7 variables -as PC7’s proportion of explained variance is really close to the mean- resulting in a percentage of 71.8895108%.”

Now having chosen the optimal number of principal components, we’ll perform a reduction of the dimension via factor analysis.

4 Factor analysis

4.1 Different estimation methods

A method for extracting factors should be chosen, such as the principal factor, maximum likelihood, etc. The fa() function implements up to 6 different methods. This example compares the results of two methods.

First, two models with three factors each are created. The factor matrix for three latent factors in the two models is displayed.

#Two models with three factors.

#Model 1: factor analysis is performed in a polychoric correlation matrix.
modelo1<-fa(poly_cor,
            nfactors = 3,
            rotate = "none",
            fm="mle") #maximum-likelihood method
#It is specified that three factors are extracted, and no rotation is applied to them. The maximum likelihood method is used for the estimation of these factors.

#Model 2.
modelo2<-fa(poly_cor,
            nfactors = 3,
            rotate = "none",
            fm="minres") #minimal residual model
#The second model is analogous to the previous one, except for the last command, where the estimation of the factors is done using the minimum residuals model.

#Outputs of these models: factor weight matrices, etc. Blank entries represent null loadings.
print("Modelo 1: mle")
## [1] "Modelo 1: mle"
modelo1$loadings
## 
## Loadings:
##                        ML2    ML1    ML3   
## Year                                  0.197
## Life.expectancy         0.616 -0.232  0.645
## Adult.Mortality        -0.445  0.102 -0.487
## infant.deaths                  0.997       
## Alcohol                 0.477 -0.131  0.150
## percentage.expenditure  0.337         0.287
## Hepatitis.B                   -0.185  0.225
## Measles                        0.505       
## BMI                     0.570 -0.254  0.220
## under.five.deaths              0.997       
## Polio                   0.263 -0.189  0.410
## Total.expenditure       0.265 -0.138       
## Diphtheria              0.268 -0.195  0.424
## HIV.AIDS               -0.322        -0.357
## GDP                     0.358 -0.122  0.335
## Population                     0.541       
## thinness..1.19.years   -0.799  0.500  0.212
## thinness.5.9.years     -0.800  0.505  0.222
## Resource.Income         0.543 -0.172  0.541
## Schooling               0.562 -0.220  0.532
## 
##                  ML2   ML1   ML3
## SS loadings    3.597 3.420 2.170
## Proportion Var 0.180 0.171 0.109
## Cumulative Var 0.180 0.351 0.459
print("Modelo 2: minres")
## [1] "Modelo 2: minres"
modelo2$loadings
## 
## Loadings:
##                        MR1    MR2    MR3   
## Year                    0.169         0.101
## Life.expectancy         0.833  0.323       
## Adult.Mortality        -0.537 -0.276       
## infant.deaths          -0.519  0.819       
## Alcohol                 0.494  0.134 -0.199
## percentage.expenditure  0.466  0.234 -0.348
## Hepatitis.B             0.288         0.465
## Measles                -0.298  0.382       
## BMI                     0.632              
## under.five.deaths      -0.533  0.800       
## Polio                   0.509  0.121  0.497
## Total.expenditure       0.297              
## Diphtheria              0.531  0.126  0.579
## HIV.AIDS               -0.346 -0.204       
## GDP                     0.511  0.246 -0.305
## Population             -0.213  0.503       
## thinness..1.19.years   -0.717  0.216  0.262
## thinness.5.9.years     -0.715  0.220  0.263
## Resource.Income         0.715  0.301       
## Schooling               0.758  0.270       
## 
##                  MR1   MR2   MR3
## SS loadings    5.763 2.365 1.245
## Proportion Var 0.288 0.118 0.062
## Cumulative Var 0.288 0.406 0.469

Finally, a comparison of the communalities of these two methods is illustrated. It appears that communalities of the likelihood model (first column) are greater than those of the minimum residual model (second column).

#Comparison of the communalities.
sort(modelo1$communality,decreasing = T)->c1
sort(modelo2$communality,decreasing = T)->c2
head(cbind(c1,c2))
##                             c1        c2
## infant.deaths        0.9959985 0.9405253
## under.five.deaths    0.9959976 0.9271596
## thinness.5.9.years   0.9437952 0.8060443
## thinness..1.19.years 0.9338352 0.6494882
## Life.expectancy      0.8494845 0.6326199
## Schooling            0.6472546 0.6292941
#Comparison of uniqueness, that is, the proportion of variance that has not been explained by the factor (1-communality).
sort(modelo1$uniquenesses,decreasing = T)->u1
sort(modelo2$uniquenesses,decreasing = T)->u2
head(cbind(u1,u2),n=10)
##                               u1        u2
## Year                   0.9517615 0.9554806
## Hepatitis.B            0.9099781 0.9035363
## Total.expenditure      0.9096144 0.8311416
## percentage.expenditure 0.7943096 0.7606215
## HIV.AIDS               0.7669905 0.7014155
## GDP                    0.7452375 0.6996738
## Measles                0.7394481 0.6986724
## Alcohol                0.7329746 0.6322187
## Polio                  0.7265124 0.6075658
## Diphtheria             0.7107326 0.5896661

4.2 Appropriate number of latent factors

There are different criteria, among which the Scree plot (Cattel 1966) and parallel analysis (Horn 1965) stand out. According to the following graphical outputs, 3 is considered to be the optimal number of factors (parallel analysis), despite 6 are the appropriate according to the first graphical scree plot.

#Scree plot.
scree(poly_cor)

It is observed that there is y=1 in the plot as a reference line for random variance. Eigenvalues of the correlation matrix that are above it explain more variance than expected by chance. Therefore, according to this Scree plot, 6 would be taken as the optimal number of factors.

#Parallel analysis.
fa.parallel(poly_cor,n.obs=100,fa="fa",fm="ml")

## Parallel analysis suggests that the number of factors =  3  and the number of components =  NA

In this plot, the real data and simulated data with the model are represented by blue and red lines, respectively. It is observed that there are clearly 3 eigenvalues of the principal factors above the plot of simulated data. This indicates that 3 would be considered as the optimal number of factors.

4.3 Estimation of the factorial model

The factorial model with 5 factors implementing a varimax rotation to seek a simpler interpretation is performed.

modelo_varimax<-fa(poly_cor,nfactors = 5,rotate = "varimax",
                   fa="mle")

#The rotated factorial matrix is shown.
print(modelo_varimax$loadings,cut=0) 
## 
## Loadings:
##                        MR1    MR2    MR5    MR4    MR3   
## Year                    0.017 -0.020  0.179  0.040  0.125
## Life.expectancy         0.369 -0.061  0.805  0.238  0.301
## Adult.Mortality        -0.184  0.001 -0.708 -0.115 -0.114
## infant.deaths          -0.172  0.977 -0.042 -0.029 -0.098
## Alcohol                 0.519  0.020  0.079  0.263  0.147
## percentage.expenditure  0.205 -0.029  0.112  0.902  0.008
## Hepatitis.B             0.030 -0.132  0.077 -0.025  0.529
## Measles                -0.085  0.475 -0.053 -0.026 -0.108
## BMI                     0.514 -0.112  0.363  0.103  0.156
## under.five.deaths      -0.171  0.971 -0.065 -0.029 -0.117
## Polio                   0.172 -0.069  0.196  0.078  0.700
## Total.expenditure       0.324 -0.062  0.014  0.095  0.097
## Diphtheria              0.169 -0.065  0.180  0.060  0.828
## HIV.AIDS               -0.043  0.004 -0.657  0.019 -0.020
## GDP                     0.197 -0.045  0.180  0.873  0.059
## Population             -0.074  0.541  0.053  0.002 -0.004
## thinness..1.19.years   -0.845  0.333 -0.187 -0.027  0.003
## thinness.5.9.years     -0.835  0.339 -0.190 -0.028  0.007
## Resource.Income         0.421  0.000  0.462  0.322  0.287
## Schooling               0.485 -0.025  0.430  0.321  0.316
## 
##                  MR1   MR2   MR5   MR4   MR3
## SS loadings    2.846 2.692 2.355 1.957 1.851
## Proportion Var 0.142 0.135 0.118 0.098 0.093
## Cumulative Var 0.142 0.277 0.395 0.492 0.585

The following diagrammatic representation is used to illustrate with which variables each of the factors is correlated.

fa.diagram(modelo_varimax,space=15)

Another way to do the previous analysis.

#This function only performs the mle method (maximum-likelihood estimator method).
FA<-factanal(expect_num,factors=5, rotation="varimax")
FA$loadings
## 
## Loadings:
##                        Factor1 Factor2 Factor3 Factor4 Factor5
## Year                    0.154                           0.116 
## Life.expectancy         0.882           0.273   0.208   0.263 
## Adult.Mortality        -0.680          -0.169  -0.112         
## infant.deaths                   0.974  -0.188                 
## Alcohol                 0.223           0.385   0.253   0.135 
## percentage.expenditure  0.154           0.164   0.925         
## Hepatitis.B                    -0.116                   0.552 
## Measles                         0.490                  -0.103 
## BMI                     0.416  -0.110   0.455   0.103   0.150 
## under.five.deaths               0.972  -0.184          -0.113 
## Polio                   0.247           0.131           0.704 
## Total.expenditure                       0.252   0.100         
## Diphtheria              0.226           0.137           0.832 
## HIV.AIDS               -0.599                                 
## GDP                     0.203           0.169   0.893         
## Population                      0.542                         
## thinness..1.19.years   -0.201   0.290  -0.899                 
## thinness.5.9.years     -0.196   0.296  -0.897                 
## Resource.Income         0.559           0.283   0.283   0.249 
## Schooling               0.563           0.314   0.276   0.270 
## 
##                Factor1 Factor2 Factor3 Factor4 Factor5
## SS loadings      2.759   2.653   2.499   1.973   1.811
## Proportion Var   0.138   0.133   0.125   0.099   0.091
## Cumulative Var   0.138   0.271   0.396   0.494   0.585

It is observed that the code provides the factor loadings, indicating the relationships between the original variables and the extracted factors. The closer to 1 or -1, the stronger the relationship, and the closer to 0, the weaker. Additionally, if the number is positive, it indicates a direct relationship, and if negative, an inverse one.

5 Linear Discriminant Analysis (LDA)

The dataset contains information on various social indicators measured in different countries, including the country’s life expectancy indicator. Therefore, it would be interesting to provide a classification method for the lifespan based on the other indicators. To do this, a qualitative variable is defined, which will be the response variable for the classification, called ‘life.’ This variable has two categories:

  • Long life: if it is greater than the mean.
  • Short life: if it is below the mean.
life=c()
mean = mean(data$Life.expectancy)

for(k in 1:nrow(data)){
  if(data$Life.expectancy[k]<mean){
  life[k] = "short"
  }
  else
    life[k] = "long"
}

life<-as.factor(life)

5.1 Graphical exploration of data

First, we explore how well (or poorly) each of the explanatory variables considered independently classifies the life expectancy. To do this, we draw the superimposed histograms. If the histograms are separated, the variable considered would be a good individual classifier for the species.

#Create individual plots for each variable.
p1 <- ggplot(data = data, aes(x = Year, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of Year")

p2 <- ggplot(data = data, aes(x = Adult.Mortality, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of Adult.Mortality")

p3 <- ggplot(data = data, aes(x = infant.deaths, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of infant.deaths")

p4 <- ggplot(data = data, aes(x = Alcohol, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of Alcohol")

p5 <- ggplot(data = data, aes(x = percentage.expenditure, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of percentage.expenditure")

p6 <- ggplot(data = data, aes(x = Hepatitis.B, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of Hepatitis.B")

p7 <- ggplot(data = data, aes(x = Measles, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of Measles")

p8 <- ggplot(data = data, aes(x = BMI, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of BMI")

p9 <- ggplot(data = data, aes(x = under.five.deaths, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of under.five.deaths")

p10 <- ggplot(data = data, aes(x = Polio, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of Polio")

p11<- ggplot(data = data, aes(x = Total.expenditure, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of Total.expenditure")

p12 <- ggplot(data = data, aes(x = Diphtheria, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of Diphtheria")

p13 <- ggplot(data = data, aes(x = HIV.AIDS, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of HIV.AIDS")

p14 <- ggplot(data = data, aes(x = GDP, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of GDP")

p15<- ggplot(data = data, aes(x = Population, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of Population")

p16<- ggplot(data = data, aes(x = thinness..1.19.years , fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of thinness..1.19.years")

p17<- ggplot(data = data, aes(x = thinness.5.9.years, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of thinness.5.9.years")

p18<- ggplot(data = data, aes(x = Resource.Income, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of Resource.Income")

p19<- ggplot(data = data, aes(x = Schooling, fill = life)) +
  geom_histogram(position = "identity", alpha = 0.5) +
  labs(title = "Histogram of Schooling")

p1;p2;p3;p4;p5;p6;p7;p8;p9;p10;p11;p12;p13;p14;p15;p16;p17;p18;p19

The variable Schooling is the one that best differentiates between long and short lifespan (least overlapping).

5.2 Univariate and multivariate normality

Next we make a graphical exploration of the normality of the univariate distributions of our predictors by representing the histograms and performing a Saphiro test.

Univariate histograms

# Histogram representation of each variable.
par(mfcol = c(2, 2))

# Loop through selected variables
for (variable in 1:20) {
  # Create a histogram
  hist_data <- data[[variable]]
  hist(hist_data, probability = TRUE, col = grey(0.8), main = paste("Histogram of", variable), xlab = variable)

  # Overlay a normality curve
  curve(dnorm(x, mean = mean(hist_data), sd = sd(hist_data)), col = "blue", lwd = 2, add = TRUE)
}

# Reset the layout to default
par(mfcol = c(1, 1))

Qqplots

par(mfrow=c(2,2))
for (k in 1:19) {
  j0 <- names(data)[k]
  x0 <- seq(min(data[, k]), max(data[, k]), le = 50)
  for (i in 1:2) {
    i0 <- levels(life)[i]
    x <- data[life == i0, j0]
    qqnorm(x, main = paste("life expectancy", i0, j0), pch = 19, col = i + 1)
    qqline(x)
  }
}

Based on the graphical techniques employed, it is evident that we will not achieve univariate normality in our variables.

Univariate normality test (Shapiro-Wilks)

The null hypothesis that the data follow a univariate normal distribution is tested. This hypothesis is rejected if the p-value given by the Shapiro-Wilks test is lesser than 0.05. Otherwise the assumption of normality of the data is not rejected.

datos1<- data
datos1[,2]=life
datos_tidy <- melt(datos1, value.name = "value")

result <- aggregate(value ~ Life.expectancy + variable, data = datos_tidy,
                    FUN = function(x) shapiro.test(x)$p.value)
result

The hypothesis of normality is rejected as the p-value given by the Shapiro-Wilks test is lesser than 0.05.

5.3 Multivariate normality

We encounter trouble as the MVN function can only be used with less than 2000 observations and we have more. An Energy Test is another statistical test that determines whether or not a group of variables follows a multivariate normal distribution. The null and alternative hypotheses for the test are as follows:

  • H0 (null): The variables follow a multivariate normal distribution.

  • Ha (alternative): The variables do not follow a multivariate normal distribution.

#Perform the Henze-Zirkler Multivariate Normality Test
mvnorm.etest(data, R=100)
## 
##  Energy test of multivariate normality: estimated parameters
## 
## data:  x, sample size 2938, dimension 20, replicates 100
## E-statistic = 31.93, p-value < 2.2e-16

Given that our p-value < 0.05, we’ll reject the null hypothesis. Therefore, we have evidence to suggest that the variables in the dataset do not follow a multivariate normal distribution. We can also use the following code:

hz_test <- mvn(data = data, mvnTest = "hz")
hz_test$multivariateNormality

5.4 Homogeneity of variance

When there is a single predictor the most recommended test is the Barttlet test.When multiple predictors are used, it must be verified that the covariance matrix is constant in all groups. In this case it is also advisable to check the homogeneity of the variance for each predictor at the individual level.

The most recommended test is the Box M test, which is an extension of the Barttlet test for multivariate scenarios. It must be taken into account that it is very sensitive to whether the data are actually distributed according to a multivariate normal. For this reason, it is recommended to use a significance p-value < 0.001 to reject the null hypothesis.

The null hypothesis to be tested is that of equality of covariance matrices in all groups.

datos<- data[,-2]
boxM(data = data[, 1:length(datos)], grouping = life)
## 
##  Box's M-test for Homogeneity of Covariance Matrices
## 
## data:  data[, 1:length(datos)]
## Chi-Sq (approx.) = 8882.8, df = 190, p-value < 2.2e-16

In this case we do reject the null hypothesis since p-value < 0.001 and therefore we can’t assume homogeneity of variances.

It is important to remember that for this conclusion to be reliable the assumption of multivariate normality must be met, which, wasn’t the case. In fact, when there is no multivariate normal distribution, this test always comes out significant and therefore is not reliable.

5.5 Discriminant function

Now we fit a linear discriminant classification model, but we’ll have to keep in mind that the assumptions of multivariate normality and homogeneity of variance aren’t satisfied.

modelo_lda <- lda(formula = life ~ Year + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure+ Hepatitis.B + Measles + BMI + under.five.deaths+ Polio + Total.expenditure  + Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + Resource.Income + Schooling ,data = data)

modelo_lda
## Call:
## lda(life ~ Year + Adult.Mortality + infant.deaths + Alcohol + 
##     percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths + 
##     Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + 
##     Population + thinness..1.19.years + thinness.5.9.years + 
##     Resource.Income + Schooling, data = data)
## 
## Prior probabilities of groups:
##      long     short 
## 0.5707965 0.4292035 
## 
## Group means:
##           Year Adult.Mortality infant.deaths  Alcohol percentage.expenditure
## long  2007.776        101.9585      1.920391 5.916319              106.46302
## short 2007.176        212.5657      4.757941 2.825644               52.68684
##       Hepatitis.B  Measles      BMI under.five.deaths    Polio
## long     90.38385 25.91884 47.21801          3.007014 94.31923
## short    84.37839 48.13386 26.48947          7.468710 90.49733
##       Total.expenditure Diphtheria  HIV.AIDS      GDP Population
## long           6.080152   94.21127 0.1126262 4004.353    6040694
## short          5.202103   90.26405 0.1856046 2259.682    5980829
##       thinness..1.19.years thinness.5.9.years Resource.Income Schooling
## long              2.841068           2.784720       0.7552718  13.74616
## short             5.858415           5.935501       0.5253868  10.38843
## 
## Coefficients of linear discriminants:
##                                  LD1
## Year                    1.132738e-02
## Adult.Mortality         4.340997e-03
## infant.deaths           1.262522e-02
## Alcohol                -1.970419e-02
## percentage.expenditure -9.283862e-04
## Hepatitis.B            -1.582141e-02
## Measles                 2.458324e-03
## BMI                    -4.536101e-03
## under.five.deaths      -1.064769e-03
## Polio                  -3.381613e-03
## Total.expenditure      -3.877981e-02
## Diphtheria             -3.881977e-03
## HIV.AIDS                5.853773e+00
## GDP                    -1.935186e-05
## Population             -1.312976e-08
## thinness..1.19.years   -8.493223e-02
## thinness.5.9.years      1.001592e-01
## Resource.Income        -4.656544e+00
## Schooling              -6.379008e-02

The output of this object shows us the prior probabilities of each group, in this case 0.5707965 and 0.4292035 for long and short, respectively, the means of each regressor per group and the coefficients of the linear discriminant classification model, which in this case would have the form:

odds = 1.132738e-02Year+ 4.340997e-03Adult.Mortality - 1.262522e-02infant.deaths - 1.970419e-02Alcohol - 9.283862e-04percentage.expenditure - 1.582141e-02Hepatitis.B - 2.458324e-03Measles - 4.536101e-03BMI - 1.064769e-03under.five.deaths - 3.381613e-03Polio - 3.877981e-02Total.expenditure - 3.881977e-03Diphtheria + 5.853773e+00HIV.AIDS - 1.935186e-05GDP - 1.312976e-08Population - 8.493223e-02thinness..1.19.years + 1.001592e-01thinness.5.9.years - 4.656544e+00Resource.Income - 6.379008e-02Schooling

Once the classifier is built, we can classify new data based on its measurements by simply calling the predict function.

5.6 Model validation

The confusionmatrix function from the biotools package performs cross-validation of the classification model.

pred <- predict(modelo_lda, dimen = 1)
confusionmatrix(life, pred$class)
##       new long new short
## long      1570       107
## short      213      1048
#Percentage of classification errors
trainig_error <- mean(life != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 10.8917631041525 %"

In this case the correct classifications rate is 89,11% approximately.

5.7 Displaying rankings

From a geometric point of view, linear discriminant analysis separates space using a straight line. In this sense, the partimat function of the klaR package allows us to represent the classification limits of a linear or quadratic discriminant model for each pair of predictors. Each color represents a classification region according to the model, the centroid of each region and the real value of the observations are shown.

partimat(life ~ Year + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure+ Hepatitis.B + Measles + BMI + under.five.deaths+ Polio + Total.expenditure  + Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + Resource.Income + Schooling,
         data = data, method = "lda", prec = 200,
         image.colors = c("green", "orange"),
         col.mean = "yellow",nplots.vert = 1, nplots.hor=3)

Unlike the classification with the explanatory variables, which has an error of 10.89176%, when considering the classification according to each pair of variables, larger errors are made, from 0.106 to 0.429. The error rate has the majority values in the rank 0.17 and 0.3.

In view of the graphs, it is observed that the pair of predictors that best classifies the data is Adult.Mortality and Resource.Income. If a classification model with two predictors were to be built, this pair would be considered.

6 Quadratic Discriminant Analysis (QDA)

Just like for Linear Discriminant Analysis, to perform Quadratic Discriminant Analysis, we start with the graphical exploration of data and checks on univariate and multivariate normality and homogeneity of variances, which have already been conducted earlier.

6.1 Discriminant function

Although the assumption of multivariate normality is not met, considering that variances are not homogeneous, a quadratic discriminant model is fitted because it is robust against the lack of this assumption. However, it should be noted, given the possibility of obtaining unexpected results.

modelo_qda<- qda(formula = life ~ Year + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure+ Hepatitis.B + Measles + BMI + under.five.deaths+ Polio + Total.expenditure  + Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + Resource.Income + Schooling,data = data)
modelo_qda
## Call:
## qda(life ~ Year + Adult.Mortality + infant.deaths + Alcohol + 
##     percentage.expenditure + Hepatitis.B + Measles + BMI + under.five.deaths + 
##     Polio + Total.expenditure + Diphtheria + HIV.AIDS + GDP + 
##     Population + thinness..1.19.years + thinness.5.9.years + 
##     Resource.Income + Schooling, data = data)
## 
## Prior probabilities of groups:
##      long     short 
## 0.5707965 0.4292035 
## 
## Group means:
##           Year Adult.Mortality infant.deaths  Alcohol percentage.expenditure
## long  2007.776        101.9585      1.920391 5.916319              106.46302
## short 2007.176        212.5657      4.757941 2.825644               52.68684
##       Hepatitis.B  Measles      BMI under.five.deaths    Polio
## long     90.38385 25.91884 47.21801          3.007014 94.31923
## short    84.37839 48.13386 26.48947          7.468710 90.49733
##       Total.expenditure Diphtheria  HIV.AIDS      GDP Population
## long           6.080152   94.21127 0.1126262 4004.353    6040694
## short          5.202103   90.26405 0.1856046 2259.682    5980829
##       thinness..1.19.years thinness.5.9.years Resource.Income Schooling
## long              2.841068           2.784720       0.7552718  13.74616
## short             5.858415           5.935501       0.5253868  10.38843

The output of this object shows us the prior probabilities of each group, in this case 0.5707965 and 0.4292035 and the means of each regressor per group.

Once the classifier is built, we can classify new data based on its measurements by simply calling the predict function.

6.2 Validación cruzada

The confusionmatrix function from the biotools package performs cross-validation of the classification model.

pred <- predict(modelo_qda, dimen = 1)
confusionmatrix(life, pred$class)
##       new long new short
## long      1578        99
## short      208      1053
trainig_error <- mean(life != pred$class) * 100
paste("trainig_error=", trainig_error, "%")
## [1] "trainig_error= 10.4492852280463 %"

In this case the correct classifications rate is 89.56% approximately, a little over the LDA case.

6.3 Displaying rankings

The partimat function of the klaR package allows us to represent the classification limits of a linear or quadratic discriminant model for each pair of predictors. Each color represents a classification region according to the model, the centroid of each region and the real value of the observations are shown.

partimat(life ~ Year + Adult.Mortality + infant.deaths + Alcohol + percentage.expenditure+ Hepatitis.B + Measles + BMI + under.five.deaths+ Polio + Total.expenditure  + Diphtheria + HIV.AIDS + GDP + Population + thinness..1.19.years + thinness.5.9.years + Resource.Income + Schooling,
         data = expect_num, method = "qda", prec = 200,
         image.colors = c("darkgoldenrod1", "snow2"),
         col.mean = "firebrick",nplots.vert =1, nplots.hor=3)